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(57) A system, method, and various software prod- 
ucts provide for improved information retrieval in very 
large document databases through the use o? a prede- 
termined static cache. The static cache includes tor 
terms that appear in a large number of documents, a 
plurality of documents ordered by a contribution that the 
term makes to the document score of the document 
The contribution is a scalar measure of the influence of 
the term in the computed document score. The contri- 
bution reflects both the within document frequency and 
the between document frequency of the term. In addi- 
tion, the static cache includes for each term a lookup 



taWe that references selected entries for the term In an 
inverted index. Queries to the database are then proc- 
essed by first traversing the static cache and obtaining 
the contribution information fchereform and computing 
the document score from this information. Additional 
term frequency information tor other terms in tfie query 
is obtained by looking up the document in the lookup 
tables of the other query terms, and obtaining the term 
frequency information tor such terms from the inverted 
index, or by searching the contrtoutton caches of the 
query terms. 
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Description 

BACKGROUND 

5 Field of Invention 

The present invention relates to systems and methods for computer based text retrieval, and more particularly, to 
systems and method for text or information retrieval from very large text databases 

10 Rafttgrcmnti pf Invmrtton 

An ever increasing amount of information is becoming available electronically, partrcuiarty through wide-area net- 
works such as the Internet The Internet and its various document collections as found in USENET, the World Wide 
Web, and various FTP and similar sites, is perhaps the largest collection of full-text information available. Already, tens 

is of rrdflions of documents are available in various document databases on the Internet Performing rapid searches for 
information on Internet already requires expensive, high performance computers with vast quantities of RAM and fast 
disk drives. Even worse, the Internet is rapidly growing. Some estimates claim that the amount of information available 
on the Internet doubles every four months. Effective computer performance doubles only every 1 8 to 24 months, and 
the cost per megabyte of storage improves even mere stosffy. To continue to scale with the growth of the Internet then, 

20 dramatic imprcA/ements in fuD4ext retrieval methods are necessary in order to provide search results of relevant docu- 
ments in an effecient and timely manner. 

Typical information retrieval systems use an "inverted index" database format For each unique term in the docu- 
ment database, the inverted index stores or tdentff ies the documents which contain the term and a measure of the fre- 
quency of the term within each document Term frequency may be measured in various manners, such as raw term 

6 counts, and various logrithrnic functions thereof. Each document in the database has a unique document number, and 
the terms in the inverted index are typically sorted by document number so that multiple rows (terms) can be efficiently 
compared by iterating over the rows in parallel. 

Conventional search systems process a query by scoring documents in the database according to term frequency 
information contained in the inverted index. The terms in the query are used to identify the relevant rows in the inverted 
30 index. These rows are then traversed, and document scores computed for each of the listed documents. Most such 
scoring functions are based on a betweerKtocument term frequency known called the inverse document frequency 
(IDF) of each term that reflects the frequency of occurrence of the term within a document database, a wrthrrKtocument 
term frequency that reflects the frequency of a term in each document and a normafization factor k. typically the length 
of the document vector. Such a scoring function may be: 

35 

So=ZW* fn "' ! l DF ' to 

</ In 

40 

where So is the document score for document D, q iterates over each term of the query, W q is a weight for term q, IDF q 
is the IDFofterrnqmac^encfocun^d 

l D is the normafization factor for document D, typically the length of the vector represented by the document 

45 One problem with this database design and o^ery processing techn^efe 

ative significance of terms in database structure itself. Experience with large text databases has shown that terms that 
most often appear in queries are typically the same terms that occur most frequently in the c^cument coUection itself. 
Therefore, these terms typically have a large number of document/frequency tuples in the inverted index. Reading and 
processing all these tuples in order to compute document scores is computationally expensive and time consuming. 

so Some conventionaJ information retrieval systems use a different sort order in the inverted index to arrange the doc- 
uments corresponding to each term However, this means trot the documents cannot be efficiently stored using a dif- 
ferential compression technique, which is one known method for decreasing the size of the inverted index. Differential 
compression techniques are typically quite efficient on dense rows when those rows are sorted by increasing document 
number; using a different sort order eliminates this benefit Therefore the total bytes required to store the document/fre- 

55 quency tuples for the given term increases dramatically. The increased size of the inverted index in turn has a significant 
impact on the resources required to store and manage the database 

Other conventional information retrieval systems cache the results of frequent queries so that the database and 
inverted index do not need to be processed and scored when the query is already contained in the cache. However, 
queries performed on a general purpose retrieval system with a very diverse document collection, as is typical on the 
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Internet tend to exhibit little repetition. For example, on some existing Internet search systems, only 30% of the queries 
occur more than ten times per day. and only 50% occur more than once, out of thousand of queries. Caching even the 
30% of queries that repeat would only marginally improve performance, and require substantial memory resources. 
Accordingly, it is desirable to provide a database structure and query processing technique that efficiently handles 
5 queries in very large text databases, and accounts for the significance and repetitiveness of certain terms in the que- 
ries, while still providing scaJabtity as the document collection grows. 

SUMMARY OF THE INVENTION 

70 The present invention overcomes the limitations of conventional information retrieval systems through the use of an 
improved database organization and query execution process. Generally, a database in accordance with the invention 
includes in addition to the inverted index, a small persistent data structure that stores a static cache of Important" doc- 
uments tor some (or all) of the terms in the inverted index. The static cache stores sufficient informatson about each doc- 
ument to enabJe the retrieval system to quickly calculate a document score for the important documents without having 

is to traverse the inverted index in a conventional manner. This cache is consulted for each term in a query, and if possible 
the quay is completed and documents scored from the cache using the information contained therein. A significant fea- 
ture provided by the system is the abarty to prune the search so that the only a relatively few documents must be scored 
in order to obtain bounds on the scores of all the other documents in the coflection. As a result, a relatively small number 
of ctocuments are scored and returned, while still guaranteeing that no unsccred document is more relevant to the query 

20 than those that were scored. 

In one embedment, present invention comprises a contribution cache and an efficient mechanism for accessing 
arbitrary documents in inverted index. In one implementation discussed here, the latter mechanism is fuelled by the 
lookup-table, but other methods could be used as weft. 

In an information retrieval system in accordance with the present invention, there is provided a database of docu- 

25 merits stored persistently in one or more computer readable memories, such as hard disk, optical disk, or the Ifta A 
typical database used with the invention may have 500,000 or more documents, and may be distributed across various 
computer systems. Each document is associated wrth a iiniquedocum 

containing a number of terms t documents are scored by the system according to the touching formula: 

30 

So=iw.c, m 



35 where Sq is the score of document D, W, is the (rxxmafized) weight of term t in the query and Ct is the contribution from 
term t to the overall score tor the document D. Equation (2) is a re-expression of (1) above, where q is: 

40 C, = J • 13> 

In 

where f t is a frequertcy of me tenn I in o^oim 
base. 

45 In accordance with the present bwention, the database is structured to include an inverted index, which may be 
conventional. In addition to (or incorporated cfirectfy in) the inverted index, there is provided the static cache. The static 
cache contains an entry for each term of the inverted index that has more than k (document term frequency) tuples. K 
may be set at any useful value, depending on the total number of documents i n the database, and the distrfartion of 
document across terms. In most rxeferred err&colmem In general, these are the 

so terms tor which a query wou!d typically require very extensive and time consuming processing in a ccrwenticnal system 
due to the number of documents that contain the term in me inverted index, which for common terms, may be in the 
tens of thousands, or more. This is because, as noted above, the ctoaiments in the inverted index are typically ordered 
by some document identifier, and not by any measure of the significance of the term to the document or the database 
as a whola The present invention eliminates this defect with the static cache. 

55 In one embodiment each entry in the static cache includes a contribution cache and a lookup taWe. The contribu- 
tion cache contains a Gstof (document contr&utfon) tuples where "contribution" is a measure of the contribution the 
term makes to the document score of the document. The Est may contain k tuples, using the same threshold number 
as before. Or alternatively, some other number of tuples may be stored, for example, based on a percentage of the 
number of documents containing the term, or based on a threshold contribution value. The contribution may be com- 
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puted as described above in (3). However, other contribution formulas may also be used. The contribution need only be 
a function of both the wrthino'ocurnent frequency of the term (f D J and a between^ocument frequency of the term, such 
as IDF. 

The contribution cache tuples are sorted by the contribution value. Documents are represented by any useful tden- 
5 tifier mechanism, such as their document number, pointer, or the like. As used herein, a document may be identified by 
both indexing and referencing mechanisms, or combinations thereof. The term "document identifier" is used to mean 
any such referencing value- 
Since the contribution value is the greatest influence on the document score of each document having the docu- 
ments ordered by contribution means that the documents to which the term most strongly contributes are first available 
w to the system tor scoring and retrieval This is turn provides tor highly efficient and fast query processing. 

The lookup table contains some number of pointers into the (document term frequency) tuples in the inverted index 
for the term This aflows random access to the frequency irtfwmaticai for a s 
out having to traverse the entire row of the inverted index to obtain the frequency information. 

More particularly, in one embedment the lookup table is a primary index to the inverted index. The (document 
15 term frequency) tuples in the inverted index are considered as arranged in blocks, each block having some number of 
tuples, such 100 tuples. The lookup table then contains the document identifier of, and pointer (or index) to, the first 
tuple tn each btock. Thereby any (document term frequency) tuple in the Week may be easily accessed by first a binary 
search, Bnear interpolation, or other search technique, into the lookup table given a document identifier, then a refer- 
ence into the block of the inverted index, and then a scan of the block. 
20 Since the static cache is arranged by terms, rt may stored in the inverted index itself, or provided as a separate file 
or taWa Storage in a separate file provides benefits of decrease search time due to improved locality. 

As an optional structure in the static cache, there may be provided a cache index far each term entry. The cache 
index is a list of indices (0 to the number of tuples) into the contribution cache. Whereas the ccninbution cache is 
ordered by the contribution value, the cache index for a term is ordered by the document identifiers. The cache Index 
25 allows tor the raped determination of whether a given document is found in the contribution cache of a term. 

In conjunction with the improved organization of the database as described, the present invention provides 
improved methods tor processing queries. There are two main cases for handling queries: single-term queries and mul- 
tiple-term queries. 

For single-term queries, only the contribution cache needs to be searched. Since there is only one term in the 

30 query, for a given document its entire document score will be a function of that term's contribution, as shown in (2) 
above. Since oocuments are already sorted by decreasing contribution in th e contribution cache, a first subset of doc- 
uments, such as the first 1 0, in the contribution cache can simply be returned as the results of the query, edher with, or 
without computing the document scora This provides a significant performa n ce advantage over conventional systems 
which must traverse the entire inverted index to score the documents therein. 

35 if the term is not present in the contribution cache, then conventional scoring routines nay be used. 

The implementation for multiple-term queries is only slightly more complex. Here, a parallel unpack method is used. 
Each of the term rows in the contribution cache are traversed in parallel, and a document score is determined for lowest 
matching document in all of the rows. 

In many cases, while traversing the terms of the query, there will be almost always be documents that are present 

40 in the contribution cache for one term, but not tor another. That is, the document will appear in less than ail of the query 
term robs of the contribution cache, and likely, in only one such row. For example, if the query is "apple and orange", 
there will be an arbitrary document, say document number 1 000 , that appears in the contribution cache of "appJ e" but 
does not appear in the contribution cache of "orange." The present invention provides several mechanisms for deter- 
mining whether this condition exists and for completing the query. 

45 In one embodiment since a ctocument identifier is already known, the lookup table tor "orange" is searched to find 
the pointer into the inverted index row of "orange" for the block containing document number 1000. (As explained above, 
the inverted index rows are sorted by document number. The poirter cot 

ing points, and feted size fields, or can be seached in various manners). Once referenced into the correct Week in the 
inverted index, only a limited number of entries therein need be expanded until the desired document number 1000 is 

so reached. The document score for this document is then updated from the term frequency information in the inverted 
index at this point using the equation (1 ) set forth above. This process of referencing the lookup table and traversing a 
limited portion of the term rows in the inverted index is repeated for each term of the document when the document is 
not found in the contribution cache portion of tor the term. As long as the lookup table is properly constructed to allow 
relative efficient random access to (document term frequency) information, the information retrieval system win have to 

55 unpack far fewer (document term frequency) tuples in the inverted index than it would unpacked H it had iterated over 
the entire row of the inverted index as in a conventional information retrieval system. 

Where the optional cache index is used, this problem may be handled even more efficiently by using the document 
identifier to search the cache index of a query term. Retrieved cache indices are mapped to the underlying document 
identifier in the contrtoution cacha During querying processing, the contribution cache is searched using the cache 
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index, and matching the document cderrtffiers against the given document identifier, tf there a match, then the contribu- 
tion for the tern can be immediately accessed from the contribution cache, without incurring the I/O expense associ- 
ated with using the lookup tables and the inverted index. This approach substantially increases performance. 

However, multiple-term queries introduce another problem. It is theoretically possible for documents that are not 

s present in any of the contribution-caches to have higher scores than some documents in the caches: the uncached doc- 
ument could contain low-ccrtfribution terms, but enough of such terms so that their sum exceeds the scores of cached 
documents (which may contain only a few htgrHXXrtrfcution terms each). 

Fortunately, the contf bution-caches of the present invention entirely solve this problem by identifying a maximum 
bound on the scores of otherwise unscored documents. As a query is processed, and the term rows of the corrtribution 

10 cache are traversed for the terms of the query, an upper bound document score value is maintained as the combined 
contributions of the documents (regardless of their document tdentrfer) at a furthest advanced index tor the cont rib ution 
cache rows. Because the contraction caches are sorted in order of contribution value, all unscored documents that 
remain in the contrfcution caches will have, at any time; a document score less than the defined upper bound document 
score. Further, no document that is not in contribution can have a higher documertt score than this upper bound, 

is because rt is already known that the contrtbuticn o? the terms of the query to the docurnent was so losf as to not include 
the document in the contribution cache of one or more the terms. 

Accordingly, while the documents in the contribution cache are being scored, a set of n docurnent scores of is main- 
tained, such as the top 20 or 1 00 document scores, or any number as desired, as the current search results. A minimum 
document score from this set is maintained, and updated as new documents are scored. Each time the minimum doc- 

20 ument score is updated, a is compared wrth the upper bound documert seem 

from the current search results is less than the upper bound, the documents in the contribution cache are scored. Once 
the minimum document score in the result set is less than or equal to the upper bound, then the document scoring is 
halted. The search results are guaranteed to include the n highest scoring documents for the query, even though many, 
perhaps tens of thousands, of documents containing some terms of the query have not even been scored. No other 

2s known information retrieval system can guarantee this type of result 

The use of the mmirrtum document score also provides another optional technique for handling ctournents appear- 
ing only the contribution cache of some terms, and not others. When a document is not present in the contribution 
cacheofagjveno^eryterm,th^ 

for the document Since the c^rnert does not appearing 

so ment score can be determined by the lowest ccntrfcution value in the exxrtrfoution cache of the term. Once stored, doc- 
ument retrieval continues with other documents in the contribution caches. The minimum document score of the search 
results can be compared with the maximum document scores of the stored documents. Onty those documents having 
a maximum document score greater than the minimum document score of the search results need to be further proc- 
essed and completely scored. In many cases, none of the stored documents need to be scored at aB. In either case. 

ss significant time and computational efficiency may be acherved. 

The present invention provides considerable performance advantages over conventional information retrieval sys- 
tems. In experiments on a large docurnent database, improvements in retrieval speed and throughput by a factor of 20 
or 30 have been experienced, and improvemerrtsby a factor of 100 are not uncommon. 

The contributiofK&che and lookup-table sizes can be tuned to balance retrieval peribrmance against memory 

40 requirements. Because the auxiliary structures are typically smaller than the original document/rreo^jertcy tuples, they 
can be stored in less main memory. This means that smaller machines can still perform efficient queries on large data- 
bases. 

Various aspects of the present invention are capatte of efferent enrollments. These include, for example, the par- 
ticular structural arrangements for a database of documents to include the contribution cache, the random access 

45 mechanism and its variants; methods for preprocessing documents in the database to create the contrtoution caches; 
metrxxisforprocesar^ toe structural 

arrangements of the static caches, and useful querying processing methods. 

The present invention is appScabie to a variety of compression schemes within both the primary document/fre- 
quency data structure and the cache and lookup tables themselves. The invention can be extended to handle boolean 

so constraints or other raters. While designed for very large databases, the invention also provides noticeable performance 
improvements even on relatively small collections. 

In addition, while the above scoring function (2), (3) is particularly useful, many other different scoring functions and 
variations thereof may be used with the present invention, which is independent of any particular scoring function. For 
example, the invention can easily be extended to handle inverse cosine scaling. Also, the present invention may be use- 

55 fully extended to information retrieval systems that treat multiple word phrases as single terms, providing an inverted 
index entry, and static cache for such terms, and allowing searching on the phrases. In such systems, contribution 
caches are created for the phrase terms, with the appropriate document identifier and temvphrase frequency informa- 
tion. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is an illustration of an information retrieval system in accordance with the present invention. 
Rgure 2 is an illustration of the software architecture of the database computer and document database in accord- 
5 ance with the present invention. 

Figure 3a is an illustration of the static cache including the contribution cache, lookup tables, and cache index 
Figure 3b illustrates an example of the cache index. 

Figures 4a through 4e illustrate an example of the use of the static cache during query processing in accordance 
with the present invention. 

io Figures 5a. 9* and 5c, are flowcharts of various methods of query processing in accordance with the present 
invention. 

DETAILED DESCRIPTION OF THE INVENTION 
is System Archfteclurfl 

Referring now to Rgure 1. there is shown the architecture of one embodiment of a system in accordance with the 
present invention. In system 100, there is at least one client computer 101 (or "cfienf). and a database computer 102, 
communicatively coupled over a network, preferably the Internet or other similar wide area networks, or even local area 

20 neteforte. as the actual network architecture temrtn^eria! to to 
pled to and controls a document database 103. 

In this embedment, a client computer 101 is of conventional design, and includes a processor, an addressable 
memory, a display, a local hard disk (though diskless computers may also be suitably used), input/output ports, and a 
network interface. The display is of conventional design, preferably color bitmapped, and provides output for a graphical 

25 user interface for software applications thereon. The input/output ports support input devices, such as a keyboard, 
mouse, and ihe like, for inputting commands and data. The client computer 1 01 executes a conventional operating sys- 
tem. A conventional network interface to a network provides access to remotely situated mass storage devices, along 
with access to the Internet, with a TCP-IP type connection, or to other network embedments, such as a WAN, LAN, 
MAW or the like. In the preferred embodiment the client computer 101 may be implemented on a Intel-based computer 

so operating under t^crosoft Windows® operating system, or equivalent devices. A cfient computer 101 executes soma 
form of cfient application that interfaces with the database computer 1 02 to request and receive dccumarris therefrom, 
and display such documents to the user. 

A client computer 101 executes some form of client application that interfaces with the database computer 1 02 to 
provide user queries thereto and receive documents satisfying such queries Herefrom, and display such documents to 

35 the user. In the preferred embodiment where the database computer 102 is accessed over the Internet or Wortd Wtde 
Web, the client application is adapted for communication via the Hypertext Transfer Protocol, and further adapted for 
decoding and displaying HTML documents. The cfient application may be a database frontend. a Wortd Wide Web 
browser, or other similar appfications, executing conventionally in the local memory of the client computer 101. It is 
anticipated that in a preferred embodiment, the cfient computer 101 may be personal computer as used by an end user, 

40 whether in their home or place of employment, in order to access documents and information stored in the document 
databases 1 03 tistrtbuted on the Internet or other network. 

In terms of hardware architecture, the database computers 102 are conventional server type computers, preferably 
supporting a relatively large number of muli^le cfients simultaneously for haroffing search and document requests, and 
other processing operations. The database computers 102 may be implemented with Intel-based personal computers, 

45 or other more powerful processors, such as various models of Sun Microsystems* SpaicStations, operating under their 
UNIX implementation. The database computers 102 provide one or more conv e ntio na l processors, and a suitable 
amount of RAM, preferably on the order of 18-64 Mb. 

Referring to Figure 2, in terms of software architecture, in accordance with the present invention, each databsse 
computer 102 comprises a database management system 104, having a DDL 110 and DML 109 for defining and 

so manipulating the document database 103. In addition, the database management system 104 Is adapted to accordance 
with the present invention to provide an application programming interface 1 07 with preprocessing, update, and retrieve 
methods. The preprocess method creates the static cache for an existing database 103 of documents. The update 
method updates the static cache as new documents are added to the database 1 03. The retrieve method provides for 
query processing adapted to the static cache of the present invention. Client app&cations 122, of whatever type, hold 

55 the necessary interfaces to the database management system 1 04. typically tor invoking the retrieve method. The data- 
base computer 102 further includes a network communication protocol 1 1 1 for handing ccrTtmunication with multiple cfi- 
ent computers 101. A conventional operating system 112 is also present These software elements operate 
conventionally in the addressable memory 105 of the database computer 102. Those of skill in the art will appreciate 
that the database management system 104 with an application programming interface 107 supporting the present 
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invention may be provided to the database computer 102 as a software product on a computer readable media, such 
as CD-ROM, 8mm magnetic tape, or the like, for installation and execution thereon. 

The document database 103 coupled to a database computer 102 may have any useful interna) architecture or 
schema. The document database 1 03 preferably axxxxnmcdates anywhere from several thousand to in excess of 1 mfl- 

5 lion documents, providing persistent storage thereof in a set of document text files 1 15. The document database 1 03 is 
preferably relational, though object oriented, or network database architectures may also be used. Support for conven- 
tional structured query languages is also preferred. 

The database 103 may employ any of a variety of document representation techniques. Generally, document rep- 
resentation in accordance with the present invention includes the use of document vectors. A document vector maybe 

w constructed as a set of (term, term frequency) tuples. In addition, a document is associated with Hs fuD text in the text 
fOes 115. In some embodiments, the actual document vector may be created and stored for the document In other 
embodiments, the document vector may be created as needed during the execution of a query. 

In one preferred embodiment, the document database 103 persistently maintains an inverted index 200, a static 
cache 215, and a normalization table 217. These elements are preferably persistently stored in the storage metfia of 

15 the database 103, such as hard disk, optical disk, or the like. During operation, aO or selected portions off these ele- 
ments may be usefully copied into the addressable memory 105 of the database computer 102 for improved perform- 
ance. In particular, the static cache 21 5 is copied to memory for high speed query processing. 

Figure 3a illustrates elements of one embodiment of an inverted index 200 and static cache 215 for use with the 
present invention. Other more complicated inverted indices supporting compression, and other attrfinites of the ctocu- 

20 ments may used. Inverted ind ex 200 ind udes an ordered (typically alphabetically) table off terms 20 1 , each term being 
one of the unique terms in the database 103. Multiple word phrases, such as Intellectual property 0 , may also be 
included as individual terms in the inverted index 200, to allow for phrase searching. Each term 201 is associated wifo 
at least one, typically many (document term frequency) tuples 204. The document is uniquely identified by some cden- 
tffier. using an identification scheme, and not the full text of the document itself , which is stored in the text files 1 1 5. The 

25 term frequency describes the number of occurrences of the term in that document The (document term frequency) 
tuples 204 are ordered by the identifiers of the documents. In an ernbccSment without any compression techniques 
appfied. each tuple reo^re^ 

2 bytes for the frequency, sufficient for upto 65,536 occurences of a term in a document Differential compression may 
be used to reduce these memory requirements. Those of skfll in the art will appreciate thai other information may also 

do be stored in the tuples of the inverted index 200. 

In accordance with one embodiment of the present invention to support one version of the lookup table, the (docu- 
ment term frequency) tuples 204 are grouped into blocks 205. There are p such Weeks 205 (bfock 1 through bicck p). 
Each bJock 205 contains some number z of the tuples 204. The number is preferably predetermined and fixed In one 
embodiment, each block has about 1000 tuples 204. Aftematively, variable sized blocks may be used 

as In addition, each term 201 in the inverted index 200 has an associated inverse document frequency (IDF) 203 that 
desabes the relative significance of the term in the database 103. The IDF 203 of a term 201 may be computed in any 
of a variety of manners, the particular implementation of which is not limited by the present invention. One simple and 
useful definition of the IDF of a term T is: 

40 

IDFr^oJ^ (4) 

45 where N is the number of documents in the database 103, and nj is the number of documents in the database 103 that 
contain at least one occurrence of term T. Other more corrtpiex definitions of IDF may be used with the present inven- 
tion. 

In accordance with the present invention, database 203 further includes a static cache 215. The static cache 215 
is ordered by a set of terms 207. These terms are selected from the inverted index 200 as those for which the total 

so number of (document, term frequency) tuples 205 exceeds a predetermined threshold. The threshold may be estab- 
lished with respect to the total number of documents in the database, or other criteria. In the preferred embodiment, 
thresholds between 500 to 2000 are found to be useful. The threshold value is called *. Thus, the set of terms 207 in 
the static cache 21 5 is a subset of the set of terms 201 in the inverted index 200. 

For each of these terms 207. there is a contribution cache 209 of (document contraction) tuples 206. In a preferred 

55 embodiment n tuples are stored. N may be determined in a variety of manners, not limited by the invention. In one 
embodiment, n is equal to k. Alternatively, n may be a percentage of k, based on the number of documents in the data- 
base. Also, n may vary between corrtribution caches 209. For example, the number of tuples may dynamically result 
from the use of a contribution threshold value, such that tuples 208 are created only for those documents for which the 
contribution of the term 207 exceeds a predetermined threshold. This limits the tuples to including only a certain range 
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of documents with a known degree of relevance to the term. For convenience of notation, the examples herein will 
assume that rt=k tuples are stored, but it is understood that n can be determined in any of the foregoing, or other way 
manners. Also, "document" here is understood by those of ski!) in the art to mean that some identification scheme is 
used to identify a document, and that the text of the document itself is not stored in the tuple. 
5 The tuples 208 are ordered by the descending value of the contrfoution. The contribution is the contribution of the 
term to a document score for the identified document. The contribution is preferrabty computed as shown above in (3) 
and reproduced here: 

Id 



is The static cache 215 further includes a lookup table 214 for each term 207. The lookup table 214 includes a 
number of (document pointer) tuples. Where there are p blocks 205 in the inverted index, there are p (document 
pointer) tuples 21 3. The document is the document Identifier of the first (document term frequency) tuple 204 in corre- 
porafing block 205 for the term in the inverted index, and the pointer references the memory location of the block, or 
alternatively indexes the offset of the bScck in the row. In either case, the lookup table 214 enables the database man- 

20 agement system 104 to easily access the tuples 204 in the block 205 for a given document in order to score the docu- 
ment during query processing. Typically, this is done by a binary search, or other technique, through the lookup table 
214 for the corresponding block containing the docutn^ 

inverted index 200. The block is traversed until the given document identifer is located. A document score based on the 
term is then computed from the term frequency information in the (document term frequency) tuple 204. 

2s When applied to an existing database for the first time, the preprccess method of the database management sys- 
tem 104 is used to create the static cache 215. In a simple embocfiment the preprocess method creates each contribu- 
tion cache 209 by selecting those term rows of the inverted index 200 that have more than k (document term 
frequency) tuples 204. For each selected term, the preprocess method creates a contrfoution cache 209 entry and 
stores the term. The preprocess method then traverses the documents in the term row to determine the contrfoution off 

so the term to the document acconSng to its frequency in the document For COTtputaitonaJ efficiency the documents ar® 
ranked by contrfoution as the corttrfoutiorts are being determfoed. fa 

such as AVL trees, sptay trees, or the like, for storing the n highest ranking documents by COTirfoution. Once afl docu- 
ments in the term row are processed, the n tuples 208 are stored from this data. AftemativeJy, all of the contribution val- 
ues for a term may be determined f irst ranked, and then the n documents with the highest contrfoution values selected. 

35 The preprocess method then creates the lookup table 214 for the term by traversing the blocks of the term row in 
inverted index 200 and storing the (document pointer) pair 213 information for the appropriate blocks. Again, this proc- 
ess may be done in fine while scanning the term row for computing term contributions. 

The contrfoution cache 209 for each term 207 is preferrably updated in conjunction with the inverted index 200 
when a document is initially processed and entered into the document database 103. This may be achieved with the 

40 update method, or similar operations. Generally, the document text is iterated over, the unique terms in the document 
identified, along with their frequencies. The IDFs of terms, and the (document term frequency) tuples 204 are then 
updated in the inverted index 200, as are the contrfoutions of each the terms to the document using (3). The (document 
contribution) tuples 208 are then updated, by re-crdering the tuples 208 on the basis of the net* contrfeuticns for terms 
in the document disgarcfing tuples 208 if necessary so that n tuples always remain. Any changes in the term row for 

45 the inverted index 200 that effect the block orderings are updated to the lookup tatrfe 214 o? the term m the static cache 
215. 

As a further optional enhancement to the database structures illustrated in Figure 3a. a cache index 223 may be 
associated with each contribution cache 209. The cache index 223 contains n index entries, each entry storing an index 
to one of the tuples 208 in the contrfoution cache 209. These entries are ordered not by the value of the index, or the 

so contribution, but rather, by the values of the document identifier in the respective tuples 20a. Ftgure 3b iDustratss a sim- 
ple example, with a contrfoution cache 209 having 5 tuples 208, and the accompanying cache index 223. The entries 
223-1 to 223-5 include the index values to tuples 208-1 through 208-5. and are ordered by the document identifiers in 
the tuples. This cache index 223 is optionally consulted during query processing, by looking up the document identifier 
for a particular index entry, and matching it with a previously determined document identifier. This enables very rapid 

55 determination of whether a document is present in the contrfoution cache of a given term, without having to incur the 
input/output overhead and time delay associated with using the lookup tables 214 and inverted index. 

Those of skill in the art will appreciate that the structures of Figure 3a and 3b are merely descriptive of one embod- 
iment useful for explaining and practicing the present invention, and that many other variations and implementations 
may be used to achieve the results and benefits of the invention. 
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Query Processing 

Referring now to Figures 4a through 4d, there are shown illustrations of the contribution caches 209 and lookup 
tables 214 of the static cache 21 5 for explaining the processing of queries using the database structures in accordance 

s with the present invention. In these figures, there are three rows, each representing a term in a user's query, the query 
being "apple orange banana". For each term there is shown a portion of the contrfouticn cache 209 having the (docu- 
ment, contribution) tuples 208. For example, the first tuple 208 in the contribution cache 209 of "apple" is document 63. 
Being the first document, its contribution c is the highest at .98, meaning that the term "apple" contributes to the docu- 
ment score of this document more than it contributes to the document score of any other document Note that the seal* 

io ing of the contributions here are arbitrary. Also, as described above, there would be anywhere from about 500 to 2,000 
or more entries in each of the contribution caches 209. The cache indices 223 are not shown in these figures. 

For each term there is shown a portion of the lookup table 214 for the term, here indexed to Mocks of 50 entries in 
the inverted index 200. Each lookup table 214 has (document pointer) tuples 213. Note that the pointer values are 
merely indicated by the letter JP", and again, would point to, or index to, different portions or offsets in the respective 

15 term rows of the inverted index 200. As described above, there would be p entries in the took up table 214, where p is 
the number of btocks in each row of the inverted index 200. 

Also illustrated is the result set 301 that stores a limited nurr^ or documents sorted by document score. The size 
of the result set 301 may be determined on demand when the user submits the query. Typically, the result set 301 
includes the top 20 to 100 documents located in response to a query. 

20 Initially the result set 301 is empty, and new documents and document scores are added until the IMf is reached. 
The document score of the last entry in the result set 301 is usefully stored in the minimum document score 302; alter* 
natively, a may be cCrectly accessed as needed from the document score of the last entry in the set 301. As mere nsro 
documents are scored, if their document score is below the minimum document score 303, then the score is cfisgarded, 
and the document is not added to the result set 301. 

25 Finally, an upper bound document score 303 is also maintained. 

A query processing method increments a cache counter 305 over each column of the (document ccrrtrfbutfon) 
tuples 208 listed in corrtributicn caches 209, in order to score the documents identified in the contnbuticn caches 209 
of each of the query terms in a parallel manner. The score of a document is based cn the contr&wtion of a query term 
to the document as found in the contrtoution cache 209 tor the term, and on the term frequencies of the other query 

so terms in document. These term frequencies may be found by looking up the bJock that "contains" the document fri the 
lookup table another query term, and obtaining the frequency information from the term row of that other term in the 
inverted index. Alternatively, a document score may be determined by using the cache index 223 of a query term to 
search the contrfcution cache 209 of term, for the document, and using the term contribution therein to compute the' 
document score. As documents are scored the results are placed in the result set 301 if the document score is greater 

35 than the minimum document score 302, and the rrunirroirn document score 302 is updated from this as well. 

The process is terminated whenever the minimum document score 302 is greater than the upper bound document 
score 303. Tne upper bound document score 303 is the sum erf toe contrfcu^ 

counter 305 across the contribution caches 209 of all me query terms. More precisely, tor a cache counter i, and con- 
tribution caches j for T query terms, the upper bound document score Uj is: 

40 

r 

45 

The process can terminate on this basis because no unscored document that rs in any contribution cache at a loca- 
tion greater than the current cache counter 305 (or not even present in the a>ntrfcution caches at all) can have a docu- 
ment score that is greater than the upper bound document score 303. Thus, when the upper bound document score 
so 303 becomes less than the minimum document score 302 in the result set 301. there is no further need to score docu- 
ments. The result set can therefore be returned to the user. 

Figures 5a, 5b, and 5c illustrates flowcharts for various methods of query processing in accordance with the 
present invention. This methods are managed by the database management system 104, typically by an implementa- 
tion of a retrieveO method provided in the application programming interface of the database management system 104. 
55 Figures 4b through 4d will be used to explain these operations by way of an example. 

Referring now to Figure 5a, the minimum document result 302 is initialized to 0, and the result set 301 is also ini- 
tialized 501. 

The retrieveO method then begins traversing the contribution caches 209 by incrementing 503 a cache counter 305 
over the number n of (document contribution) tuples 208 in a corttribution cache 209. The initial position of the cache 
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counter 305 is shown in Figure 4b with the cache counter 305 covering the first entries in the contribution caches 209. 

The upper bound document score 303 is set 505 to the combined value of the contributions of the i* (document, 
contribution) tuples 208 m the contribution caches 209 all the query terms, as per (4). In Figure 4b, the upper bound 
document score 302 is 2.42, equal the summed contributions of all of the f Erst contribution cache tuples for the query 
s terms. As cache counter i is incremented, this value will change, typically dropping slightly with every increment 

The upper bound document score 303 is compared 507 with the minimum document score 302. Since it is greater 
(2.42 > 0) the process continues. 

Now the query terms, are iterated over 509, for each query term j, from 1 to X the total number of query terms. 

The identifier of the current document is retrieved 51 1 from the i m entry in the contraction cache 209 of query term 
10 j QR.docJd); this becomes the current_doc_id. This is the identifier of the document that is to be scored, m Figure 4b, 
this would be document 63, from the first (document contribution) tuple 208 tor the first query term "apple", thus setting 
the cun*ent_docJd to 63. 

The document score for document 63 is initialized 51 3 using the contribution (here .98) in a scoring function, for 
example as set forth in (2): - 

15 

$< = tw,C. (2) 

ml 

20 

This establishes the first component of document 63 s score, that corrtrfcruted by the query term "appla" 

Now the score of document 63 is updated with respect to the other query terms, "banana* and "orange," in one of 
various implementations. In one approach, the retrieve method scores each of the query terms on the current document 
25 by searching 518 the lookup tables 214 of the query terms, and accessing the term frequency information from the 
inverted index entry for document and the term. This technique is illustrated in Figure 5b as operations 51 9^523, ami in 
Figures 4b-4a In a second approach, the cache index 223 and contribution cache 209 of the query term is searched 
51 7 to identify the current document in the comrfcution cache 209. The document te then scored from the term contri- 
bution. This technique Is illustrated in Figure 5c. Deperxfing on the results of the cache index search 51 7, various other 
30 operations may be undertaken, as further described bdow. 

The lookup table search 518 is discussed first Here, the reJrieveQ method loops 515 over thes© other terms using 
loop variable k. For "banana", the currerrt_dcc_td, here 63 for the f erst document, is found 519 in the lookup table 214, 
as being in the block beginning with document 50. This finding operation may be done by linear search, binary search, 
linear interpolation, or other efficient means, the particular implementation not being limited by the present invention. 
35 However the entry is found, the currerrLpointer is taken from this entry, here R In the example of Ftgure 4b. for doc- 
ument 63, this pointer references the memory location in the inverted index 200, for the 50 th bfock 205 for the term 
"banana." Beginning from this bfock arri traversing 521 forward, there will be found (since these entries are ordered by 
document identifier) the (document identrfer, term frequency) tuple for document 63. The frequency of the term 
"banana" is obtained 523 and the document score for document 63 is updated 525 with the scoring function using this 
40 term frequency information. 

Note that although document 63 does appear in the contnbution cache 209 of "banana" this entry was not used 
here to score the document in this embodiment An embodiment that does search the contrfcution cache 209 using the 
cache index 223 is described below. 

Now the document score for document 63 is updated for the term "orange." As before, the lookup table 214 for 
45 "orange" is searched 51 9 to get the pointer to the bSock 205 in the term row for "orange* in the inverted index 200 that 
contains the (document, term frequency) tuple 208 for document 63, and the term "orange". This frequency information 
is then used to update 523 the document score for c^cumem 63. For the sake of illustration, the document score of doc- 
ument 63 is assumed to be 1 .5. 

Otk^ all of the query terrrs have been processed, then the document score for c^mem 63, artdffieo^mentts 
so placed 525 in the result set 301 , and the rronimum Ajcument score 303 is updated to 1 .50, here since there is only one 
entry, with the document score of document 63. 

The process then returns to the next contnbution cache 209 at the current cache counter 305 of 1 . this being doc- 
ument 25 in first entry for the term "banana", and repeats, scoring document 25 on each of the query terms. Again, the 
lookup tables 214 of "apple" and "orange" will be respectively searched 51 9, and the inverted index rows for these terms 
55 traversed 521 for the (document term frequency) tuple 208 tor document 25. Next document 61 . the first document in 
the contnbution cache 209 of "orange" will be scored in the same manner. Each time, the result set 301 and minimum 
document score 302 is updated. The result set 301 of Ftgure 4b shows assumed document scores following the scoring 
of these three documents, 63, 25, and 61 . 

Referring now to Figure 4c, once document 61 . the last document in tie first window of coiitribution cache tuples 
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208 for cache counter^ has been scored, the process returns to 503, which increments the cache counter to 2. The 
upper bound document score 303 is set 505 to the combined sum of the contributions again, here 1.93. The minimum 
document score 302 of 1.50 is still less than the upper bound document score 303, so the process continues as above, 
this time scoring document 3, and updating the result set 301 , and minimum document score 302. Referring to Figure 

5 4d, it is assumed that document 3 has a document score of 3.36. This process continues as descrfoed, resulting in doc- 
ument scores for documents 3.12 (with an assumed score of 1 .65) and 77 (with an assumed score of 2.68). 

As each document is scored, the result set 301 is updated 525 with the documents, and their scores, placing them 
in ranked order. For the purpose of this example, it is assumed that the result set 301 is constrained to the top five 
entries, though in practice the top 20 or 100, or some other larger limit is used. The minimum document result 302 is 

10 likewise updated 525, rrarewto document score 1.65 from document 12, thereby efinranaiir^ document 63 with a score 
of 1 .50 from the result set 301. This state is shown in Figure 4d. 

Referring now to Figure 4e, there is illustrated the mechanism by which the minimum document score 302 and the 
upper bound document score 303 are used to prune the query process, while ensuring that the result set 301 has all 
possible documents that could have signfcantiy meaningful document scores. In Figure 4e, the cache counter 305 is 

is 503 incremented to the next entry (the third column), as shown, and the upper bound document score 303 is here com- 
puted 505 to be 0.98. 

Also shown in Figure 4e are the contribution cache tuples in heavy outline that are at cache counter values yeater 
than the current cache counter, but that include documents that have been previously scored. For example, in the con- 
tribution cache 209 for "orange," there appears a (document contribution) tuple 208 for document 63. This document 
20 was the first document scored as rt appeared earlier in the contribution cache 209 of "app^" The entries without out- 
lines indicate dociimerrts that have not been scored. 

New at 507. since the upper bound document score 302 of 0.98 is less tthan the rrrimrnurn document scare 302 of 
1.65. the process terminates and returns 527 the result set 301 to the user, performing any necessary post-query 
processing, such as obtaining document titles, locations, and the like. The result set 301 is guaranteed at this point such 
25 that no unscored document anywhere in the database, whether it has all. some, cr none of the terms of the query, can 
have a document score greater than the upper bound document score 303. 

Rrst oJ the documents at cache counter values greater than the present cache counter of 3, that is, documents to 
the "right" of the cache counter 305 In Figure 4e, some of these documents, me heawflyouffi^^ 
wiD have been previously scored, and meir docurnert scores accourt 
so ument score 302. 

Thus, the only documents of interest are those that have not been scored. By (2) and (3) above, the document 
score is based on me corrtribute of the query terrns. However, me contributes 

determined and stored in the contribution caches 209. The upper bound document score 303 is the greatest poseWe 
value of the contributions of such terms in the remaining documents, since they must all have tower contribution values 

35 individually then the contributions at the current cache counter. If they had higher contribution values, then they would 
have been ranked higher in the contribution caches, and hence already processed and scored. If a document contain- 
ing any of the terms of the query is not even present in the contribute cact^ men dearly m^ 
terms to the document score was minor (L* the document (fid not rank in the top k entries based on any of the terms' 
contributes). Where a query term is missing from the documert the 

40 0, and so cannot increase the overall document sccra Thus, once the rnmimurn document result 302 from the result set 
301 becomes greater than the upper bound document score 303, it is not pcssfcle for any unscored document to have 
a document score sufficient to include the result set. Therefore the query can be temtinated. 

This ability to terminate the query based on the upper bound document spore and minimum document score dra- 
maticaDy reduces retrieval times. As will be appreciated by those of skfflin 

45 of the entries in the contribute cache (e.g. 1 750 out of 2000 entries) of a given query term will always be less then 
traversing the entire term row (e.g. with 20,000 entries in a database of 1 .000,000 documents) in the inverted index for 
such terms, both because of the fewer number of entries, and because the co n tribute cache is hetd in tocaJ memory 
and therefore has a considerabry lower I/O cost then accesses to me inverted irata, which is Ckery stored on cfisk H the 
query can be terminated before the contribution cache 209 is exhausted, the retrieval times wfll always be better than 

so conventional systems. 

Note however, in some cases, the cache counter 305 may be incremented through all entries in the contribution 
cache 209. If this occurs without the upper bound document score 303 faffing below the minimum document score 302, 
then there are documents remaining in the database that may have higher document scores than tfie minimum docu- 
ment result 302. Accordingly, one alternative is to restart 528 the query processing using conventional search tech- 
55 niques. The accumulated result set 301 may be passed to the conventional search technique or discarded, if query 
processing techniques are often highly optimized for the handling of search results. 

. Referring again to Figure 5a. an alternative method for overcoming the problem of documents that do not appear 
in the contribution caches, and that are unscored, is the cache index search 517. This approach is based on the the 
optional use of the cache index 223. In this alternative approach, tor each contraction cache 209 there is a cache index 
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223 as descrfoed abova Referring to Figure 5c, given the document identifier of the current document, current_doc_td, 
the cache index 223 is searched 51 7. by selecting an index entry the index, in the cache index 223. and comparing 
the document identifier of the r" 1 tuple 208 in the contribution cache of the query term to the current document iden- 
tifier tf there is a match, then the document score of the cunent document is updated directly tor this query term using 
the contribution value from the tuple 208. as per (2), and the result set 301 is also updated 525, as shewn in Figure 5a. 
If there is no match, then the cache index 223 is searched, using a search technique such as binary search, linear inter- 
polation or the like to determine the next index entry to evaluate. Such search techniques are possible because, as 
noted above, the indices are ordered by the document identifiers of their respective tuples. 

K the current document identifier is not located in the corrtrtoution cache 223 of the k 01 query term, then processing 
continues in one of two manners. In one embodiment foe current document is not further scored on the terms of the 
query, but rather processing of the document is deferred 520. Since the document (fid not appear in the contribution 
cache 209 of the query term, it may be a relatively low scoring document, compared with other documents that have all 
of the terms of the query. Referring again to Figure 5c tor deferred processing 520. a rnaxirnum document score is 
determined 524 tor the docurnerttjjsing the contribution from the last tuple 208 of the aKtfrfcution cache of the k* query 
terra The document ts then stored 526 with this maximum document score. The stored documents are ordered by their 
maximum document scores. Processing continues with the next document The stored documents wiQ be later evalu- 
ated, once all of the terns of the query and documents in the contribution caches 209 have been evaluated. 

Once the query terms in the contribution cache 209 have been processed, there will be a known minimum docu- 
ment score 302 from the result set 301 . This rrurtimum document score 302 is compared with the maximum document 
scores of the queued documents, and only stored documents hairing a 

imum document score 302 of the result set 301 need be further evaluated 526 with respsct to all of the query terms. 

Ine reason fortius is that if a maximum 

then obviously this is not a document trtat would have been inc^ 

be scored on the remaining query terms. Deferring processing of these documents further reduces the time needed to 
identify a complete set of hi#i!y relevant documents in response to the query. 

In an alternate embodmerrt. when a document does not appear in the contr&ution cache 209 of a term, then the 
contribution cache 209 can be updated 528 on demand with the term contribution for the document for the remaining 
terms of the query. This aOows subsequent queries to be processed, having the contribution information available in the 
contribution caches 209. 

In summary, the present invention provides an improved organization and arrangement for a document database 
along with various complementary query processing techniques that dramatically improve retrieval times, and guaran- 
tee the completeness of the search results. 

Claims 

1. In an irrforrnation retrieval apparatus including a database of docum errts, each document having a plurality of terms 
and a unique document identifier, the information retrieval apparatus further including a programmed processor 
adapted to receive a query containing at least one term artd to compute in resporise to tte 

for each of a selected plurality of documents, me o 
puter memory readable by tiie processor and comprising: 

a first ordered plurality of unique terms, each unique term associated in the memory with: 

a plurality of (document term contribution) tuples, the term contribution being a scalar measure of the con- 
tribution of the term to a document score computable by the processor for the document, the tuples 
selected for those documents having the highest term contributions for the unique term from all ciocumsrrts 
in the database, the tuples ordered by the term contribution, such that the processor serially accesses a 
first subset of the tuples to compute a document score for each document in the first subset of tuples asso- 
ciated with a received term of a query. 

2. The computer readable memory of claim 1, wherein the term contribution c,of a unique term t to a document D is 
deter mined according to: 

C ~ 1 
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where: 

f 0 1 is a frequency of the term t in document D; 

IDF, is an inverse document frequency of the term t in the database; and, 

5 l D is a normalization factor for document D. 

3. The computer readable memory of claim 1, further comprising: 

an inverted index having for each unique term in the database, a plurality of (document term frequency) tuples 
io ordered by document; and, 

in association with each of the first plurality of unique terms, a random access mechanism fox accessing in the 
inverted Index the term frequency of the unique term in any document in the tuples, to provide the processor 
with the term frequency tor computing a document score for the document 
•* 

is 4. The computer readable memory of claim 3 wherein: 

each of the plurality of (document term frequency) tuples in the inverted index is arranged into a plurality p of 
Wocte, each block having a number (document term frequency) tuples, each bJcck further having a first such 
tuple; and, 

so the random access mechanism comprises, for each unique term in me first plurality of unique terms, a lookup 

tatte having a plurality of entries, each errtry identrrymgadocumem 

having a reference to a location in the inverted index of a beginning of the one btocfe, such ffiat any document 
in one of the tuples for a given unique term can be determined to be in exactty one of the plurality of pbtocte 
in the inverted index for the same unique term. 

25 

5. The computer readable memory of claim 1, further comprising: 

for each plurality of (document term contrftjution) tuples associated with a unique term, a respective plurality 
of indices to the tuples, the indices ordered by identifiers of the documents. 

30 

6. The computer readable memory of daim 1 , wherein the plurality of tuples for a unique term is determined as a func- 
tion of the number of documents having the unique term. 

7. The computer readable memory of claim 1 , wherein the plurality of tuples for a ui^ue term is determined as a funr> 
35 tion of a threshold value of the contribution of the term to the dccument 

8. A computer implemented method of processing a query containing a angle term, the method comprising: 

matching the single term to one of the plurality of unique terms in the computer reariabte memory of daim 1 ; 
40 in the tuples associated with the matched unique term, determining for a number of the tuples, a document 

score for the document in the tuple from the term contribution in the tupJe; and, 
returning as the results of the queries, the documents from the number of tuples that have been scored. 

9. A computer implemented method of processing a query containing a a plurafity of terms to identify documents in a 
45 database in response to the query, the method comprisirrg: 

prior to the receipt of the query, determining for each of a plurality of terms a contr&ution of the term to a doc- 
ument score of each of a plurafity of ckcumsnts, the contribution of a term based on a frequency of ttie term in 
the document and a frequency of the term in the database; 
so receiving the query; and, 

for each term of the query, scoring a plurality of documents from the determined contribution of at least one 
term, and from frequency information for any remaining terms of the query. 

10. The computer implemented method of claim 9, wherein determining for each of a plurality of terms a contribution 
55 of the term to a document score of each of a plurality of documents comprises: 

selecting a plurality of unique terms, each of which appearing in more than k documents in the database; 
tor each of the plurafity of unique terms: 
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determining the contribution of the term to a document score for each document containing the unique 
term; 

selecting a plurality of the documents; and, 

storing each of the selected documents and contributions in association with the terra the documents 
ordered by the contribution. 

1 1 . The computer implemented method of daim 10. wherein scoring a plurality of documents, comprises for each term 
of the query: 

searching in the documents stored in association with the term, and determining for each of a first subset of 

the documents a document score based on the contribution of the term to the document, and upon frequency 

information for other terms of the query; 

ranking the documents by their document scores; and. 

returning a selected nurnber of highly ranked documents. 

12. The computer implemented method of daim 1 1 . further comprising: 

prior to the receipt of the query, and for each of the plurality of unique terms: 

storing in en inverted index a plurafity of first entries, each first entry identifying a document and having a 
frequency of the unique term in the document and. 

storing in association with the untque term a lookup table including a plurality of second entries, each sec- 
ond entry identifying a document in a first entry in the inverted index, and having a reference to a location 
of the first entry, for obtaining from the inverted index frequency information for the term. 

1 3. The computer implemented method of daim 9, wherein scoring a plurality of documents from the determined con- 
tribution of at least one term, and from frequency information for any remaining terms of the query, further com- 
prises: 

for a term of the query for which the contribution of the term to documents was not determined prior to the 
query: 

determining the contribution of the term to a plurality of documents and storing the contrfoution of the term 
with respect to each of the documents; and. 

scoring at least one document using the newly determined contrtoution of the term. 

14. The computer implemented method of claim 9, wherein scoring a plurality of documents further comprises: 

responsive to at least one document for which a contribution of a query term was not determined prior to the 
query, suspending scoring of the document with respect to the query term and other o^erytenn^ 
a maximum document score for the document and storing the document with maximum document score in a 
first set; 

completing the scoring of other documents with respect to terms of the query, to produce a second set of Doc- 
uments having a minimum document score; and, 

for onjy those documents in the first set that have a maximum document score greater than the minimum doc- 
ument score, determining an actual document scorn for the document with respect to afl of the terms of the 
query. 

15. A computer implemented method of preprocessing a database of documents for subsequent query, processing, 
each document having a plurality of terms, each term contained in a number of documents, comprising: 

selecting a plurality of terms T in the database for which the number of documents containing term T exceeds 
a threshold k. such that there are at least k documents containing term T; 
for each term T: 

determining for each document containing term T a corrtribution of term T to a document score of the doc- 
ument; 

ranking the documents containing term T by the contribution of term T to the document score; 
selecting a plurality of highest ranked documents; and. 
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for each of selected documents, storing in association with the term, indicia of the document and the con- 
tribution of the term T to the document. 

16. The computer implemented method of claim 15, further comprising: 

creating an inverted index comprising for each unique term in the database a plurality of first entries, each first 
entry identifying a document containing the term and a frequency of the term in the document, the fret entries 
ordered sequentially by document the first entries arranged into a plurality of Weeks, each block having an ini- 
tial entry; 
for each term T: 

associating term T with a tookup tabJe of second entries, each second entry referencing an initial entry one 
of the Weeks in the plurality of blocks associated «^ term T in the inverted index 

17. The computer implemented method of claim 15. further comprising: 

for each term T: 

storing the indicia of the document and the contraction of the term T to the document, ordered accortfing 
to the indicia of the document; 

storing a set of indices, each index identifying a respective one oJ the Axuments, the set ordered by indi- 
cia of the respective ctocumerrts. 

18. The computer implemented method of claim 15. wherein the selected plurality of highest ranked documents is 
selected as the n highest ranted documents, where nsk. 

19. The computer implemented method of claim 15, wherein the selected plurality off highest ranked documents is 
selected as a function of a predetermined carrtribution threshold of a term to a document 

20. A computer memory readable by a processor in a database management system including a database of docu- 
ments, for controlling the system to preprccess the documents, the memory iratocfing computer executable instruc- 
tions tor causing the system top 

selecting unique terms in the database for which a plurality of documents containing the term exceeds a 
threshold k, such that there are at least k documents containing the unique term; 
for each selected term: 

determining for each document containing the term a contraction of the term to a document score of the 
document; 

ranking the documents containing the term by the contrtoution of the term to the atoumert score; 

selecting a plurafity of highest ranted documents; and, 

for each selected document, storing in association wift the term mdtcia 

tion of the term to the document 

21. The computer readable memory of claim 20. farther indud 
tern to perform the steps of : 

for each unique term: 

storing the ircficia of the document and the ccrrtribution of the term to the document ordered acconfrtg to 
me indca of the document; 

storing a set of indices, each index identifying a respective one of the ctocuments, the set ordered by indi- 
cia of the respective documents. 

22. The computer readable memory of claim 20. wherein the selected plurality of highest ranked documents is selected 
as the n highest ranked documents, where n±de 

23. The computer readable memory of claim 20. wherein the selected plurality of highest ranked documents Is selected 
as a function of a predetermmed inverse document frequency threshold. 
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24. A compute- memory readable by a processor in a database management system including a database of docu- 
ments, for controlling the system to process a query tor selected documents, the query including a plurality of 
terms, the memory including computer executable instructions for causing the system to perform the steps of: 

prior to the receipt of the query, determining for each of a plurality of terms a contribution of the term to a doc- 
ument score of each of a plurality of documents, the contribution of a term based on a frequency of the term in 
the document and a frequency of the term in the database; 
receiving the query; and, 

for each term of the query, scoring a plurality of documents from the determined contribution of at least one 
term, and from frequency information for any remaining terms of the query. 

25. The computer readable memory of ciaim 24, further including computer executabte instructions tor causing the sys- 
tem to determine for each of a plurality of terms a contribution of the term to a document score of each of a pkiralrty 
of documents by performing JJie steps of: 

selecting a plurality of unique terms, each of which appearing in more than k documents in the database; 
for each of the plurality of unique terms: 

determining the contribution of the term to a document score for documents containing the unique term; 
ordering the documents by the oontriajtion of the term to the document score of the documents; and. 
storing each of a plurality of documents and ccntrfcutions in association with the term; 

26. The computer readable memory of claim 25. further including computer executable instructions for causing the sys- 
tem to score a plurality of documents, by performing the steps ot 

for each term of the query, searching in the documents stored in association with the term, and detenrnning for 
each of a first subset of the documents a document score based on the contrttajtion of the term to the docu- 
ment, and upon frequency information for other terms of the query; 
ranking the documents by their document scores; and. 
returning a selected number of highly ranked documents. 

27. The computer readable memory of claim 26. further incfurJng computer executable instructions tor causing the sys- 
tem to perform the steps of: 

prior to the receipt of the query, and for each of the plurality of unique terrns: 

storing in an inverted index a plurality of first entries, each first entry identifying a document and having a 
frequency of the unique term in the ctecument; and. 

storing in association wfth the unique term a lookup table including a plurality of second entries, each sec- 
ond entry iderrtifying a document in a first entry in the inverted index, and having a reference to a location 
of the first entry, for obtaining from the inverted index frequency information tor the term. 

28. The computer readable memory of claim 26, further including computer executable instructions tor causing the sys- 
tem to score a plurality of documents by performing the steps of: 

for a term of the query for which the contribution of the term to documents was not determined prior to the 
query: 

ctaerrrtinirtg the contribution ot the term to a plurality of documents and storing the corrtrflxition of the term 
with respect to each of the documents; and, 

scoring at least one document using the newly determined contrfoution of the term. 

29. The computer readable memory of claim 26, further including computer executable instructions tor causing the sys- 
tem to score a plurality of documents by performing the steps of: 

responsive to at least one document for which a contribution of a query term was not determined prior to the 
query, suspending scoring of the document with respect to query term and remaining query terms, determining 
a maximum document score, and storing the document with maximum document score in a first set; 
completing the scoring of documents with respect to terms of the query and other documents In the database. 
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to produce a second set of documents including a document having a minimum document score; and. 
for only those documents in the first set that have a maximum document score greater than the minimum doc- 
ument score, determining an actual document score for the document with respect to all of the terms of the 
query. 

30. A database management system, comprising: 

a first ordered plurality of unique terms stored In a computer readable memory, each unique term associated 
in the memory with: 

a plurality of (document, term contribution) tuples, the term contraction being a scalar measure of the con- 
tribution of the term to a document score computable by a processor for the document the tuples selected 
as those documents having the highest term contributions for the unique term from an documents in the 
database, the tuges ordered by the term contribution; 

a first method executable by a processor that recefves a query containing a plurality of terms and for each term 
of the query, serially accesses a first subset of the tuples associated with the term, and computes for each doc- 
ument in a tuple in the first subset of the tuples, a document score for the document based on the contribution 
in the tupla 

31. The database management system of claim 30, further comprising: 

a preprocess method executable by a processor that selects ttie first plurality of unique terms from a plurality 
of terms in an inverted index, and creates the tuples for a term from frequency data of the term in each of a 
number of documents in which the term appears. 
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